Given a continuous range -
This is very useful for sampling. For example, you have an array random.sample
to do this.
Here I will show some ways to generate such a sequence.
Let's consider the simplest algorithm. We can put the entire value range directly into an array and then shuffle it. If we take the first
Actually, there's an even simpler idea, but with a flaw: just randomly pick std::vector
Analyzing correctness: Assuming the current selection is the
The probability of selecting
It's the same naive idea as before: just generate without worrying about duplicates. What if we encounter duplicates? Don't worry about it! Remember how we calculate quadratic residues? We pick a number at random, and if it's a duplicate, we don't care. We can just pick another one. The only concern is that after many selections, we might run into duplicates again and again.
However, since we are picking numbers randomly, randomness is built in :) So we have to calculate the expectation.
First, let's solve a similar problem:
Perform a task. Each time there's a probability
This is just a geometric series! It becomes
Oops! The expected number of successes is the inverse of the probability of success! (Although it's a classic result and seems quite intuitive QwQ)
We want to find the complexity of randomly choosing a number that did not appear, so it suffices to compute an upper bound. Obviously, the maximum complexity occurs when selecting the
So if we choose
However, the above algorithm cannot satisfy all scenarios. For instance, when
Then, similar to the approach for some
Which leads to:
From this, we derive that
Of course, there's more room for optimization. For instance, if std::sort
for constant factor optimizations in time and space. This results in a time and space complexity of
Another optimization concerns the time complexity. Checking whether a number has appeared can be done not only with a BST but also with a hash table. This would improve the time complexity, but the constant factors can be very large. In this case, the complexity of the second method would become
Solving this, we get
Still related to checking whether a number has appeared. When std::bitset
can greatly optimize constants. When
In many cases, we don't require such randomness. Therefore, we can divide the value range equally into
While I was sleeping I thought of another method, method 3. I'm not sure if it's correct. So if you think it is wrong, please tell me in the comments. Specifically, we first randomly select
n/V Exp
0.0001 1
0.0002 1
0.0004 1
0.0008 1
0.0016 1
0.0032 1
0.0064 2
0.0128 2
0.0256 2
0.0512 3
0.1024 4
0.2048 6
0.4096 11
0.8192 47.7
It's evident that when
Attached is the code for generating the table: tester.cpp.
After I wrote the implementation of the method above, I realize that if I replace Method 2 with Method 3, will it be faster?
It seems that in Method 3, the constant becomes too large when the ratio is inappropriate. But compared to the constant in Method 2, is this constant truly large? In addition, the ratio should be less than 0.5, so the constant reaches a maximum of 20. Is this really more optimal?
Indeed it is.
Here is the table using only Method 3 (unsigned int): M3table.
The table above represents the performance of Method 3 when fully utilizing n/V=0.8
group, so it looks good.
Running
It appears be slower in cases where the range of value is very small compared to the code above. However, we have a std::bitset
approach that strictly outperforms the hash approach for small value ranges. Combining these methods results in an optimal solution!
For Method 3, we can still optimize! We noticed that the generated numbers are ordered among themselves, only the newly generated numbers are unordered. So we just need to sort and merge the newly generated numbers. Less constant time, get it!
For Optimization 3, there's even more optimization. If we want to generate sequence many times, we can pretreat the bitset for better constants. That is, if you use std::bitset::reset
, it'll be faster than declaring one.
Here's the final code: UNIG.cpp.
Then we can check whether the program is correct using the program below for
#include<iostream>
#include<cstdio>
#include<cctype>
#include<vector>
#include<cstring>
#include<algorithm>
#include<ctime>
#include<random>
#include<cctype>
#include<iomanip>
using namespace std;
#define int long long
int x[10000005];
int cnt[10000001];
signed main() {
freopen("test.in","r",stdin);
int n,V=1e9;
cin>>n;
for(int i=1;i<=n;i++) cin>>x[i];
double avg=0;
double var=0;
cout.precision(10);
for(int i=1;i<=n;i++) avg+=x[i];
avg/=n;
cout<<"Average number: "<<avg<<endl;
for(int i=1;i<=n;i++) var+=(x[i]-avg)*(x[i]-avg);
cout<<"Variance value: "<<var/(n-1)<<endl;
return 0;
}
Result:
Average number: 499849719.6
Variance value: 8.33604178e+16
We know mathematically that the expected mean is
In this edition, I only write the important content. There's a Chinese edition of this article. Also, it's the former one. In this edition, there are more interesting things in it. But they are not important.
Thanks to ChatGPT and DeepL. They helped me to improve the English expression of this blog.